Udacity Data Scientist Capstone Project¶

"Client Engagement" Predicting Airline Passenger Satisfaction¶

Prepared by
Chris Bacani
Data Scientist Apprentice at IBM

Table of Contents¶

  1. Introduction
  2. Initial Data Cleaning
  3. Exploratory Analysis
    3.1 Multivariate Analysis
  4. Data Preparation
  5. Modelling
    5.1 Model Candidate Pipelines
    5.2 Comparison of Output Metrics
  6. Deployment and Monitoring
  7. Conclusions

1. Introduction¶

Business Problem¶

Our objective is to identify factors that contribute to passenger satisfaction or dissatisfaction, so that the airline (client) can address them to enhance customer service and improve customer retention.

Abstract¶

I decided to take on a data project regarding airline customer service data because of my passion for aviation. I have long been an avid enthusiast of aviation - and particularly commercial aviation. One of my primary goals as I grow in my role at IBM is to help airlines, cargo carriers, and airports solve different problems through innovative data-informed solutions.

With air travel beginning to resume after a grinding halt brought about by the COVID-19 pandemic, attention has been brought back to customer satisfaction - in most cases as a response to reduced or altered cabin service or public health policies brought about by the airline companies. However, passenger satisfaction was wobbly even before the pandemic. Airliners, particularly in the United States have struggled to find a balance between service and value - prioritizing only one over the other.

As an apprentice for IBM's Data Science Elite team - one huge part of our curriculum is participating on projects to solve data science problems for our clients. This project is meant to mirror typical end-to-end data science project that we would complete for our clients. From data ingestion, to cleaning and analysis, to modeling, to deployment and explainability.


This use case is one that I look forward to taking on for an air travel client in the future. Analyzing their customer service data and then passing it through a machine learning model to identify and verify factors that contribute to passenger satisfaction.

Dataset Information¶

I found this dataset on Kaggle.com when searching for datasets that detailed passenger satisfaction. The data has been pre-split into train and test sets, and a target column has been engineered. So there won't be as much cleaning and manipulation required.

Recognition of Limitations¶

Many of the aspects that make up customer service data have been obfuscated to make this dataset more abstract. Information like airline carrier, origin airport, travel date, price, offers, and travel time have been made unavailable - all which could and would have an impact on a customer service score.

Preliminary Dependency Imports¶

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as sp
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

Graphing Parameters¶

In [2]:
from matplotlib import rcParams

# Font Size
rcParams['font.size'] = 12
# Figure Size
rcParams['figure.figsize'] = 7, 5

Data Imports¶

Training data import¶
In [3]:
air_train_df = pd.read_csv('./Data/air-train.csv')
In [4]:
air_train_df.head()
Out[4]:
Unnamed: 0 id Gender Customer Type Age Type of Travel Class Flight Distance Inflight wifi service Departure/Arrival time convenient ... Inflight entertainment On-board service Leg room service Baggage handling Checkin service Inflight service Cleanliness Departure Delay in Minutes Arrival Delay in Minutes satisfaction
0 0 70172 Male Loyal Customer 13 Personal Travel Eco Plus 460 3 4 ... 5 4 3 4 4 5 5 25 18.0 neutral or dissatisfied
1 1 5047 Male disloyal Customer 25 Business travel Business 235 3 2 ... 1 1 5 3 1 4 1 1 6.0 neutral or dissatisfied
2 2 110028 Female Loyal Customer 26 Business travel Business 1142 2 2 ... 5 4 3 4 4 4 5 0 0.0 satisfied
3 3 24026 Female Loyal Customer 25 Business travel Business 562 2 5 ... 2 2 5 3 1 4 2 11 9.0 neutral or dissatisfied
4 4 119299 Male Loyal Customer 61 Business travel Business 214 3 3 ... 3 3 4 4 3 3 3 0 0.0 satisfied

5 rows × 25 columns

In [5]:
air_train_df.satisfaction.value_counts()
Out[5]:
neutral or dissatisfied    58879
satisfied                  45025
Name: satisfaction, dtype: int64
Test data import¶
In [6]:
air_test_df = pd.read_csv('./Data/air-test.csv')
In [7]:
air_test_df.head()
Out[7]:
Unnamed: 0 id Gender Customer Type Age Type of Travel Class Flight Distance Inflight wifi service Departure/Arrival time convenient ... Inflight entertainment On-board service Leg room service Baggage handling Checkin service Inflight service Cleanliness Departure Delay in Minutes Arrival Delay in Minutes satisfaction
0 0 19556 Female Loyal Customer 52 Business travel Eco 160 5 4 ... 5 5 5 5 2 5 5 50 44.0 satisfied
1 1 90035 Female Loyal Customer 36 Business travel Business 2863 1 1 ... 4 4 4 4 3 4 5 0 0.0 satisfied
2 2 12360 Male disloyal Customer 20 Business travel Eco 192 2 0 ... 2 4 1 3 2 2 2 0 0.0 neutral or dissatisfied
3 3 77959 Male Loyal Customer 44 Business travel Business 3377 0 0 ... 1 1 1 1 3 1 4 0 6.0 satisfied
4 4 36875 Female Loyal Customer 49 Business travel Eco 1182 2 3 ... 2 2 2 2 4 2 4 0 20.0 satisfied

5 rows × 25 columns

Back to Top

2. Initial Data Cleaning¶

Examining initial dataframes¶

In [8]:
# Training dataset
air_train_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103904 entries, 0 to 103903
Data columns (total 25 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   Unnamed: 0                         103904 non-null  int64  
 1   id                                 103904 non-null  int64  
 2   Gender                             103904 non-null  object 
 3   Customer Type                      103904 non-null  object 
 4   Age                                103904 non-null  int64  
 5   Type of Travel                     103904 non-null  object 
 6   Class                              103904 non-null  object 
 7   Flight Distance                    103904 non-null  int64  
 8   Inflight wifi service              103904 non-null  int64  
 9   Departure/Arrival time convenient  103904 non-null  int64  
 10  Ease of Online booking             103904 non-null  int64  
 11  Gate location                      103904 non-null  int64  
 12  Food and drink                     103904 non-null  int64  
 13  Online boarding                    103904 non-null  int64  
 14  Seat comfort                       103904 non-null  int64  
 15  Inflight entertainment             103904 non-null  int64  
 16  On-board service                   103904 non-null  int64  
 17  Leg room service                   103904 non-null  int64  
 18  Baggage handling                   103904 non-null  int64  
 19  Checkin service                    103904 non-null  int64  
 20  Inflight service                   103904 non-null  int64  
 21  Cleanliness                        103904 non-null  int64  
 22  Departure Delay in Minutes         103904 non-null  int64  
 23  Arrival Delay in Minutes           103594 non-null  float64
 24  satisfaction                       103904 non-null  object 
dtypes: float64(1), int64(19), object(5)
memory usage: 19.8+ MB
In [9]:
# Test dataset
air_test_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25976 entries, 0 to 25975
Data columns (total 25 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Unnamed: 0                         25976 non-null  int64  
 1   id                                 25976 non-null  int64  
 2   Gender                             25976 non-null  object 
 3   Customer Type                      25976 non-null  object 
 4   Age                                25976 non-null  int64  
 5   Type of Travel                     25976 non-null  object 
 6   Class                              25976 non-null  object 
 7   Flight Distance                    25976 non-null  int64  
 8   Inflight wifi service              25976 non-null  int64  
 9   Departure/Arrival time convenient  25976 non-null  int64  
 10  Ease of Online booking             25976 non-null  int64  
 11  Gate location                      25976 non-null  int64  
 12  Food and drink                     25976 non-null  int64  
 13  Online boarding                    25976 non-null  int64  
 14  Seat comfort                       25976 non-null  int64  
 15  Inflight entertainment             25976 non-null  int64  
 16  On-board service                   25976 non-null  int64  
 17  Leg room service                   25976 non-null  int64  
 18  Baggage handling                   25976 non-null  int64  
 19  Checkin service                    25976 non-null  int64  
 20  Inflight service                   25976 non-null  int64  
 21  Cleanliness                        25976 non-null  int64  
 22  Departure Delay in Minutes         25976 non-null  int64  
 23  Arrival Delay in Minutes           25893 non-null  float64
 24  satisfaction                       25976 non-null  object 
dtypes: float64(1), int64(19), object(5)
memory usage: 5.0+ MB
In [10]:
# Checking values of scored columns, using cleanliness as score to test
air_train_df.Cleanliness.value_counts()
Out[10]:
4    27179
3    24574
5    22689
2    16132
1    13318
0       12
Name: Cleanliness, dtype: int64

Observations¶

Unneeded columns¶

The columns Unnamed: 0 and id are of no use to us, so they can be dropped right away. For the sake of efficiency, we'll apply this dataframe manipulation to both the training and testing datasets.

Datatype corrections¶

Save for the 5 categorical variables that need to be converted before model ingestion, there is a data type mismatch in the delay columns. The arrival column is of a float datatype while the departure column is a whole integer datatype. Regardless of whether or not necessary, I will match the two datatypes as floats for consistency's sake when performing further analyses.

Null fills¶

In the training dataset, there are 310 missing values in the column detailing the on-time performance of a particular voyage (Arrival Delay in Minutes). We can observe 83 values missing in the test set in this same column.

Since we cannot accurately decide whether or not a null value indicates that the trip was not delayed, we will fill this with the mean value of the column.

Zero values in ranked value columns¶

In examining the dataframe's first 5 rows, I can see rows where there is a value of 0 found in columns that must be ranked on a scale from 1 to 5. For simplicity's sake - I will be dropping rows where a 0 value exists. This should reduce the complexity of our data set and also make it easier for the model to ingest later on.

Cleaning the dataframes¶

In [11]:
def clean_data(orig_df):
    '''
    This function applies 5 steps to the dataframe to clean the data.
    1. Dropping of unnecessary columns
    2. Uniformize datatypes in delay column
    3. Normalizing column names.
    4. Normalizing text values in columns.
    5. Imputing numeric null values with the mean value of the column.
    6. Dropping "zero" values from ranked categorical variables.
    7. Creating aggregated flight delay column
    
    
    Return: Cleaned DataFrame, ready for analysis - final encoding still to be applied.
    
    ''' 
    
    df = orig_df.copy()
    
    '''1. Dropping off unnecessary columns'''
    df.drop(['Unnamed: 0', 'id'], axis = 1, inplace = True)
    
    '''2. Uniformizing datatype in delay column'''
    df['Departure Delay in Minutes'] = df['Departure Delay in Minutes'].astype(float)
    
    '''3. Normalizing column names'''
    df.columns = df.columns.str.lower()

    '''Replacing spaces and other characters with underscores, this is more 
    for us to make it easier to work with them and so that we can call them using dot notation.'''
    special_chars = "/ -" 
    for special_char in special_chars:
        df.columns = [col.replace(special_char, '_') for col in df.columns]
    
    '''4. Normalizing text values in columns'''
    cat_cols = ['gender', 'customer_type', 'class', 'type_of_travel', 'satisfaction']

    for column in cat_cols:
        df[column] = df[column].str.lower() 
        
    '''5. Imputing the nulls in the arrival delay column with the mean.
    Since we cannot safely equate these nulls to a zero value, the mean value of the column is the
    most sensible method of replacement.'''
    df['arrival_delay_in_minutes'].fillna(df['arrival_delay_in_minutes'].mean(), inplace = True)
    df.round({'arrival_delay_in_minutes' : 1})
    
    '''6. Dropping rows from ranked value columns where "zero" exists as a value
    Since these columns are meant to be ranked on a scale from 1 to 5, having zero as a value 
    does not make sense nor does it help us in any way.'''
    rank_list = ["inflight_wifi_service", "departure_arrival_time_convenient", "ease_of_online_booking", "gate_location",
                "food_and_drink", "online_boarding", "seat_comfort", "inflight_entertainment", "on_board_service",
                "leg_room_service", "baggage_handling", "checkin_service", "inflight_service", "cleanliness"]
    
    '''7. Creating aggregated and categorical flight delay columns'''
    df['total_delay_time'] = (df['departure_delay_in_minutes'] + df['arrival_delay_in_minutes'])
    df['was_flight_delayed'] = np.nan
    df['was_flight_delayed'] = np.where(df['total_delay_time'] > 0, 'yes', 'no')

    for col in rank_list:
        df.drop(df.loc[df[col]==0].index, inplace=True)
    
    cleaned_df = df
    
    return cleaned_df

Applying the transformations to our training dataset¶

In [12]:
air_train_cleaned = clean_data(air_train_df)
In [13]:
air_train_cleaned.head()
Out[13]:
gender customer_type age type_of_travel class flight_distance inflight_wifi_service departure_arrival_time_convenient ease_of_online_booking gate_location ... leg_room_service baggage_handling checkin_service inflight_service cleanliness departure_delay_in_minutes arrival_delay_in_minutes satisfaction total_delay_time was_flight_delayed
0 male loyal customer 13 personal travel eco plus 460 3 4 3 1 ... 3 4 4 5 5 25.0 18.0 neutral or dissatisfied 43.0 yes
1 male disloyal customer 25 business travel business 235 3 2 3 3 ... 5 3 1 4 1 1.0 6.0 neutral or dissatisfied 7.0 yes
2 female loyal customer 26 business travel business 1142 2 2 2 2 ... 3 4 4 4 5 0.0 0.0 satisfied 0.0 no
3 female loyal customer 25 business travel business 562 2 5 5 5 ... 5 3 1 4 2 11.0 9.0 neutral or dissatisfied 20.0 yes
4 male loyal customer 61 business travel business 214 3 3 3 3 ... 4 4 3 3 3 0.0 0.0 satisfied 0.0 no

5 rows × 25 columns

Applying to test set¶

In [14]:
air_test_cleaned = clean_data(air_test_df)

Back to Top

3. Exploratory Analysis¶

What proportion of customers are satisfied or dissatisfied?¶

Here I am checking of imbalance of outcome in our prediction column. This will inform whether or not additional steps will be needed before moving on (resampling, downsampling, etc.).

In [15]:
fig = plt.figure(figsize = (10,7))
air_train_cleaned.satisfaction.value_counts(normalize = True).plot(kind='bar', alpha = 0.9, rot=0)
plt.title('Customer satisfaction')
plt.ylabel('Percent')
plt.show()

Observations : The prediction columns are not completely even, as they are the ratio of dissatisfied to satisfied customers is 55 to 45. However, I would not say that the data presents itself as imbalanced.

I can say that the data does not require additional treatment or resampling.

What do customers currently think about their travel service?¶

What do the categorical variables look like? What is the spread of responses across the dataset?

In [16]:
from matplotlib.pyplot import figure

categoricals = ["inflight_wifi_service", "departure_arrival_time_convenient", "ease_of_online_booking", 
                "gate_location", "food_and_drink", "online_boarding", "seat_comfort", "inflight_entertainment", 
                "on_board_service", "leg_room_service", "baggage_handling", "checkin_service", 
                "inflight_service", "cleanliness"]

air_train_cleaned.hist(column = categoricals, layout=(4,4), label='x', figsize = (20,20));

Whigh groups are shown to be more satisfied with their travel experience? Men or women? Loyalty customers or casual fliers?¶

In [17]:
with sns.axes_style(style = 'ticks'):
    d = sns.histplot(x = "gender",  hue= 'satisfaction', data = air_train_cleaned,  
                     stat = 'percent', multiple="dodge", palette = 'Set1')
In [18]:
with sns.axes_style(style = 'ticks'):
    d = sns.histplot(x = "customer_type",  hue= 'satisfaction', data = air_train_cleaned, 
                     stat = 'percent', multiple="dodge", palette = 'Set1')

Observations : The gender groups are equally likely to be either satisfied or dissatisfied. This slight skewness reflects the proportions of the target variable.

Loyalty customers outweigh non-loyalty customers dramatically. However, among both groups - dissatisfied customers make up the majority class.

There is no difference between the positive and negative outcomes of the target class among these two groups - something that suggests there could be a fundamental difference in these groups to tell us something. Needs clearer wording

While I'm not discounting these variables outright as far as feature importance is concerned, I don't believe that there are any ingsights to be gained from further analysis of these two variables.

How do passengers in different cabins feel about their overall travel experience?¶

In [19]:
with sns.axes_style(style = 'ticks'):
    d = sns.histplot(x = "class",  hue= 'satisfaction', data = air_train_cleaned,
                     stat = 'percent', multiple="dodge", palette = 'Set1')

Observations : Passengers who travel in a business class cabin are far more likely to finish their trip with a positive impression than are those who fly in economy or economy plus/premium economy.

The dramatic difference in satisfaction between fliers in a business class cabin versus an economy type cabin suggests that there may be characteristic differences here informing these distributions. For that reason I'd like to examine this column further.

Does the reason for travel have an impact as to whether or not a passenger is satisfied with their service?¶

In [20]:
with sns.axes_style(style = 'ticks'):
    d = sns.histplot(x = "type_of_travel",  hue= 'satisfaction', data = air_train_cleaned,
                     stat = 'percent', multiple="dodge", palette = 'Set1')

Observations : Passengers travelling for business seem far more likely to be satisfied with the experience of their trip than those who travel for personal reasons.

In both groups of this column, there are dramatic and inverse proportions regarding the positive and negative outcomes of the target variable — satisfied customers vs dissatisfied or neutral.

This is another column where I think we can find great insights from further analyses.

Are certain age groups more prone to be satisfied/dissatisfied with their travel experience?¶

In [21]:
# Countplot comparing age to satisfaction
with sns.axes_style('white'):
    g = sns.catplot(x = 'age', data = air_train_cleaned,  
                    kind = 'count', hue = 'satisfaction', order = range(7, 80),
                    height = 8.27, aspect=18.7/8.27, legend = False,
                   palette = 'Set1')
    
plt.legend(loc='upper right');
In [22]:
sns.violinplot(data = air_train_cleaned, x = "satisfaction", y = "age", palette='Set1')
Out[22]:
<AxesSubplot:xlabel='satisfaction', ylabel='age'>

Observations : Majority of age customers across all age groups are generally more dissatisfied/neutral regarding their trip experience. However, there is a group of customers ranging from age 38 to 61 who are generally satisfied with their trip experience.

As with the previous two columns, the discrepancy points to potential insights to be unearthed by conducting further analyses.

Comparing satisfaction across different ages in cabin flown¶

In [23]:
sns.violinplot(data = air_train_cleaned, x = "class", y = "age", 
               hue = 'satisfaction', palette = 'Set1', split = True)
Out[23]:
<AxesSubplot:xlabel='class', ylabel='age'>

Does the distance traveled on a flight have any impact as to whether or not a customer is going to be satisfied?¶

In [24]:
sns.violinplot(data = air_train_cleaned, x = "satisfaction", y = "flight_distance", palette = 'Set1')
Out[24]:
<AxesSubplot:xlabel='satisfaction', ylabel='flight_distance'>
In [25]:
sns.violinplot(data = air_train_cleaned, x = "class", y = "flight_distance", 
               hue = 'satisfaction', palette = 'Set1', split = True)
Out[25]:
<AxesSubplot:xlabel='class', ylabel='flight_distance'>

How much of a factor is on-time performance in determining satisfaction?¶

In [26]:
fig, axes = plt.subplots(1, 2)

sns.violinplot(data = air_train_cleaned, x = "satisfaction", 
               y = "departure_delay_in_minutes", ax = axes[0], palette = 'Set1')
sns.violinplot(data = air_train_cleaned, x = "satisfaction",
               y = "arrival_delay_in_minutes", ax = axes[1], palette = 'Set1')
Out[26]:
<AxesSubplot:xlabel='satisfaction', ylabel='arrival_delay_in_minutes'>
Aggregated column¶
In [27]:
sns.violinplot(data = air_train_cleaned, x = "satisfaction", y = "total_delay_time", palette = 'Set1', split = True)
Out[27]:
<AxesSubplot:xlabel='satisfaction', ylabel='total_delay_time'>

Observations : With a majority of flights not having significant delays, I don't believe this particularly is a factor determining passenger satisfaction.

Back to Top

In [28]:
air_train_cleaned.was_flight_delayed.value_counts()
Out[28]:
yes    52436
no     43268
Name: was_flight_delayed, dtype: int64

3.1 Multivariate Analysis¶

Rated Score Columns¶

In [29]:
score_cols = ["inflight_wifi_service", "departure_arrival_time_convenient", "ease_of_online_booking", 
              "gate_location","food_and_drink", "online_boarding", "seat_comfort", "inflight_entertainment", 
              "on_board_service","leg_room_service", "baggage_handling", "checkin_service", "inflight_service",
              "cleanliness"]

How do satisfied and dissatisfied customers score different parts of their travel experience?¶

In [30]:
plt.figure(figsize=(40, 20))
plt.subplots_adjust(hspace=0.3)

# Loop through scored columns
for n, score_col in enumerate(score_cols):
    # Add a new subplot iteratively
    ax = plt.subplot(4, 4, n + 1)

    # Filter df and plot scored column on new axis
    sns.violinplot(data = air_train_cleaned, 
                   x = 'satisfaction', 
                   y =  score_col, 
                   ax = ax,
                   palette = 'Set1')

    # chart formatting
    ax.set_title(score_col),
    ax.set_xlabel("")

Observation : Lower scores rating in-flight wi-fi and online booking have a higher impact on passenger dissatisfaction. Meanwhile passenger dissatisfaction has more influence from middle-high scores rating gate location, baggage handling, check-in, and in-flight customer service.

High marks in nearly all categories illustrate a clear impact on customer satisfaction.

Does age affect rated aspects of airline service?¶

In [31]:
plt.figure(figsize=(40, 20))
plt.subplots_adjust(hspace=0.3)

# Loop through scored columns
for n, score_col in enumerate(score_cols):
    # Add a new subplot iteratively
    ax = plt.subplot(4, 4, n + 1)

    # Filter df and plot scored column on new axis
    sns.violinplot(data = air_train_cleaned, 
                   x = score_col, 
                   y = 'age', 
                   hue = "satisfaction",
                   split = True,
                   ax = ax,
                   palette = 'Set1')

    # Chart formatting
    ax.set_title(score_col),
    ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.05),
          fancybox=True, shadow=True, ncol=5)
    ax.set_xlabel("")

Observations : For many of these customer service categories, the individual distributions of satisfaction vs dissatisfaction across each rating for customer service directly mirror the bivariate distributions of age vs satisfaction. There is a noticeable peak in the 37-60 age group across most of these distributions.

Certain customer service variables (in-flight customer service, baggage handling, leg room, on-board service quality, and inflight entertainment) have passengers measured as satisfied despite scoring the service column poorly. In the leg room category, there are a large amount of younger travelers dissatisfied despite scoring the column highly.

There are interesting distributions in the online boarding category - where the 40-60 age group is measured as satisfied despite giving the service column the lowest possible score, and dissatisfied despite giving the service column the highest possible score.

I can't see any trends that explicitly illustrate age as a factor with impact on customer satisfaction.

Does flight distance affect rated aspects of airline service?¶

In [32]:
plt.figure(figsize=(40, 20))
plt.subplots_adjust(hspace=0.3)

# Loop through scored columns
for n, score_col in enumerate(score_cols):
    # Add a new subplot iteratively
    ax = plt.subplot(4, 4, n + 1)

    # Filter df and plot scored column on new axis
    sns.violinplot(data = air_train_cleaned, 
                   x = score_col, 
                   y = 'flight_distance', 
                   hue = "satisfaction",
                   split = True,
                   ax = ax,
                   palette = 'Set1')

    # Chart formatting
    ax.set_title(score_col)
    ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.05),
          fancybox=True, shadow=True, ncol=5)
    ax.set_xlabel("")

Observations : While many of these features have equal distributions of satisfied and dissatisfied customers acrossall of the flight distances - there are a few points that suggest certain features having more of an impact informing satisfaction.

There is a higher rate of satisfaction for customers who provided the highest mark for in-flight wi-fi service and online booking ease.

Does the passenger cabin traveled affect rated aspects of airline service?¶

In [33]:
plt.figure(figsize=(40, 20))
plt.subplots_adjust(hspace=0.3)

# Loop through scored columns
for n, score_col in enumerate(score_cols):
    # Add a new subplot iteratively
    ax = plt.subplot(4, 4, n + 1)

    # Filter df and plot scored column on new axis
    sns.violinplot(data = air_train_cleaned, 
                   x = 'class', 
                   y = score_col, 
                   hue = "satisfaction",
                   split = True,
                   ax = ax,
                   palette = 'Set1')

    # Chart formatting
    ax.set_title(score_col)
    ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.05),
          fancybox=True, shadow=True, ncol=5)
    ax.set_xlabel("")

Observations : These discrete distributions tell me that cabin class is going to be a large determinant informing passenger satisfaction. There are clear peaks in satisfaction and dissatisfaction for in-flight wi-fi service, online boarding, seat comfort, in-flight entertainment, on-board customer service, leg room, and in-inflight customer service.

Online boarding and cabin cleanlilness appear to be an important factor in driving satisfaction regardless of cabin class in which the passenger is traveling.

On-board service quality, leg room, baggage handling, check-in service, and inflight customer service appear to be important factors informing passenger satisfaction for business class passengers.

An interesting plot is the in-flight wi-fi service column. High ratings appear to have a high impact driving satisfaction for customers flying in economy plus and economy classes. It doesn't appear to have as much of an impact on satisfaction for business class travelers, but mediocre or low ratings appear to have a high impact on dissatisfaction for travelers in this cabin.

Does the reason for travel affect rated aspects of airline service?¶

In [34]:
plt.figure(figsize=(40, 20))
plt.subplots_adjust(hspace=0.3)

# Loop through scored columns
for n, score_col in enumerate(score_cols):
    # Add a new subplot iteratively
    ax = plt.subplot(4, 4, n + 1)

    # Filter df and plot scored column on new axis
    sns.violinplot(data = air_train_cleaned, 
                   x = 'type_of_travel', 
                   y = score_col, 
                   hue = "satisfaction",
                   split = True,
                   ax = ax,
                   palette = 'Set1')

    # Chart formatting
    ax.set_title(score_col)
    ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.05),
          fancybox=True, shadow=True, ncol=5)
    ax.set_xlabel("")

Observations : The distributions across these service columns considering the purpose of travel illustrate a high importance in informing customer satisfaction.

Similarly to the trend we observed regarding cabin class, high ratings in boarding process have a high impact on satisfaction. Inflight wi-fi service also has a good impact on customer satisfaction, though we can observe an almost inverse effect when considering the purpose of travel. For personal travelers, high ratings impact satisfaction, though satisfaction seems guaranteed for business travelers regardless of rating. However, low scores for in-flight wi-fi from business travelers impacts dissatisfaction.

We can also observe the same drivers of satisfaction for business travelers (business purpose and business cabin). These being inflight customer service, cleanliness, check-in service, baggage handling, legroom, and on-board service quality.

Back to Top

4. Data Preparation¶

Encoding the dataframe to be ready for model ingestion¶

In [35]:
air_train_cleaned.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 95704 entries, 0 to 103903
Data columns (total 25 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   gender                             95704 non-null  object 
 1   customer_type                      95704 non-null  object 
 2   age                                95704 non-null  int64  
 3   type_of_travel                     95704 non-null  object 
 4   class                              95704 non-null  object 
 5   flight_distance                    95704 non-null  int64  
 6   inflight_wifi_service              95704 non-null  int64  
 7   departure_arrival_time_convenient  95704 non-null  int64  
 8   ease_of_online_booking             95704 non-null  int64  
 9   gate_location                      95704 non-null  int64  
 10  food_and_drink                     95704 non-null  int64  
 11  online_boarding                    95704 non-null  int64  
 12  seat_comfort                       95704 non-null  int64  
 13  inflight_entertainment             95704 non-null  int64  
 14  on_board_service                   95704 non-null  int64  
 15  leg_room_service                   95704 non-null  int64  
 16  baggage_handling                   95704 non-null  int64  
 17  checkin_service                    95704 non-null  int64  
 18  inflight_service                   95704 non-null  int64  
 19  cleanliness                        95704 non-null  int64  
 20  departure_delay_in_minutes         95704 non-null  float64
 21  arrival_delay_in_minutes           95704 non-null  float64
 22  satisfaction                       95704 non-null  object 
 23  total_delay_time                   95704 non-null  float64
 24  was_flight_delayed                 95704 non-null  object 
dtypes: float64(3), int64(16), object(6)
memory usage: 19.0+ MB
In [36]:
air_train_cleaned.head()
Out[36]:
gender customer_type age type_of_travel class flight_distance inflight_wifi_service departure_arrival_time_convenient ease_of_online_booking gate_location ... leg_room_service baggage_handling checkin_service inflight_service cleanliness departure_delay_in_minutes arrival_delay_in_minutes satisfaction total_delay_time was_flight_delayed
0 male loyal customer 13 personal travel eco plus 460 3 4 3 1 ... 3 4 4 5 5 25.0 18.0 neutral or dissatisfied 43.0 yes
1 male disloyal customer 25 business travel business 235 3 2 3 3 ... 5 3 1 4 1 1.0 6.0 neutral or dissatisfied 7.0 yes
2 female loyal customer 26 business travel business 1142 2 2 2 2 ... 3 4 4 4 5 0.0 0.0 satisfied 0.0 no
3 female loyal customer 25 business travel business 562 2 5 5 5 ... 5 3 1 4 2 11.0 9.0 neutral or dissatisfied 20.0 yes
4 male loyal customer 61 business travel business 214 3 3 3 3 ... 4 4 3 3 3 0.0 0.0 satisfied 0.0 no

5 rows × 25 columns

In [37]:
from sklearn.preprocessing import OrdinalEncoder
In [38]:
def encode_data(orig_df):
    '''
    Encodes remaining categorical variables of data frame to be ready for model ingestion
    
    Inputs:
       Dataframe
       
    Manipulations:
        Encoding of categorical variables.    
    
    Return: 
        Encoded Column Values
    '''
    
    '''
        Ordinal encode of scored rating columns.
    '''
    
    df = orig_df.copy()
    
    encoder = OrdinalEncoder()
    
    for j in score_cols:
        df[j] = encoder.fit_transform(df[[j]]) 
    
    '''
        Replacement of binary categories.
    '''
    df.was_flight_delayed.replace({'no': 0, 'yes' : 1}, inplace = True)
    df['satisfaction'].replace({'neutral or dissatisfied': 0, 'satisfied': 1},inplace = True)
    df.customer_type.replace({'disloyal customer': 0, 'loyal customer': 1}, inplace = True)
    df.type_of_travel.replace({'personal travel': 0, 'business travel': 1}, inplace = True)
    df.gender.replace({'male': 0, 'female' : 1}, inplace = True)
    
    encoded_df = pd.get_dummies(df, columns = ['class'])
    
    return encoded_df
In [39]:
# Applying encoding to training dataset
air_train_encoded = encode_data(air_train_cleaned)
In [40]:
air_train_encoded.cleanliness.value_counts()
Out[40]:
3.0    25294
2.0    22634
4.0    20928
1.0    14727
0.0    12121
Name: cleanliness, dtype: int64
In [41]:
air_train_encoded.head()
Out[41]:
gender customer_type age type_of_travel flight_distance inflight_wifi_service departure_arrival_time_convenient ease_of_online_booking gate_location food_and_drink ... inflight_service cleanliness departure_delay_in_minutes arrival_delay_in_minutes satisfaction total_delay_time was_flight_delayed class_business class_eco class_eco plus
0 0 1 13 0 460 2.0 3.0 2.0 0.0 4.0 ... 4.0 4.0 25.0 18.0 0 43.0 1 0 0 1
1 0 0 25 1 235 2.0 1.0 2.0 2.0 0.0 ... 3.0 0.0 1.0 6.0 0 7.0 1 1 0 0
2 1 1 26 1 1142 1.0 1.0 1.0 1.0 4.0 ... 3.0 4.0 0.0 0.0 1 0.0 0 1 0 0
3 1 1 25 1 562 1.0 4.0 4.0 4.0 1.0 ... 3.0 1.0 11.0 9.0 0 20.0 1 1 0 0
4 0 1 61 1 214 2.0 2.0 2.0 2.0 3.0 ... 2.0 2.0 0.0 0.0 1 0.0 0 1 0 0

5 rows × 27 columns

In [42]:
air_train_encoded.satisfaction.value_counts()
Out[42]:
0    54947
1    40757
Name: satisfaction, dtype: int64
In [43]:
type(air_train_encoded)
Out[43]:
pandas.core.frame.DataFrame
In [44]:
air_train_encoded.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 95704 entries, 0 to 103903
Data columns (total 27 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   gender                             95704 non-null  int64  
 1   customer_type                      95704 non-null  int64  
 2   age                                95704 non-null  int64  
 3   type_of_travel                     95704 non-null  int64  
 4   flight_distance                    95704 non-null  int64  
 5   inflight_wifi_service              95704 non-null  float64
 6   departure_arrival_time_convenient  95704 non-null  float64
 7   ease_of_online_booking             95704 non-null  float64
 8   gate_location                      95704 non-null  float64
 9   food_and_drink                     95704 non-null  float64
 10  online_boarding                    95704 non-null  float64
 11  seat_comfort                       95704 non-null  float64
 12  inflight_entertainment             95704 non-null  float64
 13  on_board_service                   95704 non-null  float64
 14  leg_room_service                   95704 non-null  float64
 15  baggage_handling                   95704 non-null  float64
 16  checkin_service                    95704 non-null  float64
 17  inflight_service                   95704 non-null  float64
 18  cleanliness                        95704 non-null  float64
 19  departure_delay_in_minutes         95704 non-null  float64
 20  arrival_delay_in_minutes           95704 non-null  float64
 21  satisfaction                       95704 non-null  int64  
 22  total_delay_time                   95704 non-null  float64
 23  was_flight_delayed                 95704 non-null  int64  
 24  class_business                     95704 non-null  uint8  
 25  class_eco                          95704 non-null  uint8  
 26  class_eco plus                     95704 non-null  uint8  
dtypes: float64(17), int64(7), uint8(3)
memory usage: 18.5 MB
In [45]:
# Applying encoding to test dataset
air_test_encoded = encode_data(air_test_cleaned)

Correlation of all features against target variable¶

In [46]:
train_corr = air_train_encoded.corr()[['satisfaction']]
train_corr = train_corr

plt.figure(figsize=(10, 12))

heatmap = sns.heatmap(train_corr.sort_values(by='satisfaction', ascending=False), 
                      vmin=-1, vmax=1, annot=True, cmap='Blues')

heatmap.set_title('Feature Correlation with Target Variable', fontdict={'fontsize':14});

Feature Selection¶

In [47]:
# Pre-processing and scaling dataset for feature selection
from sklearn import preprocessing

r_scaler = preprocessing.MinMaxScaler()
r_scaler.fit(air_train_encoded)
 
air_train_scaled = pd.DataFrame(r_scaler.transform(air_train_encoded), columns = air_train_encoded.columns)
air_train_scaled.head()
Out[47]:
gender customer_type age type_of_travel flight_distance inflight_wifi_service departure_arrival_time_convenient ease_of_online_booking gate_location food_and_drink ... inflight_service cleanliness departure_delay_in_minutes arrival_delay_in_minutes satisfaction total_delay_time was_flight_delayed class_business class_eco class_eco plus
0 0.0 1.0 0.076923 0.0 0.086632 0.50 0.75 0.50 0.00 1.00 ... 1.00 1.00 0.015704 0.011364 0.0 0.013539 1.0 0.0 0.0 1.0
1 0.0 0.0 0.230769 1.0 0.041195 0.50 0.25 0.50 0.50 0.00 ... 0.75 0.00 0.000628 0.003788 0.0 0.002204 1.0 1.0 0.0 0.0
2 1.0 1.0 0.243590 1.0 0.224354 0.25 0.25 0.25 0.25 1.00 ... 0.75 1.00 0.000000 0.000000 1.0 0.000000 0.0 1.0 0.0 0.0
3 1.0 1.0 0.230769 1.0 0.107229 0.25 1.00 1.00 1.00 0.25 ... 0.75 0.25 0.006910 0.005682 0.0 0.006297 1.0 1.0 0.0 0.0
4 0.0 1.0 0.692308 1.0 0.036955 0.50 0.50 0.50 0.50 0.75 ... 0.50 0.50 0.000000 0.000000 1.0 0.000000 0.0 1.0 0.0 0.0

5 rows × 27 columns

In [48]:
# Feature selection, applying Select K Best and Chi2 to output the 15 most important features
from sklearn.feature_selection import SelectKBest, chi2

X = air_train_scaled.loc[:,air_train_scaled.columns!='satisfaction']
y = air_train_scaled[['satisfaction']]

selector = SelectKBest(chi2, k = 10)
selector.fit(X, y)
X_new = selector.transform(X)

features = (X.columns[selector.get_support(indices=True)])
features
Out[48]:
Index(['type_of_travel', 'inflight_wifi_service', 'online_boarding',
       'seat_comfort', 'inflight_entertainment', 'on_board_service',
       'leg_room_service', 'cleanliness', 'class_business', 'class_eco'],
      dtype='object')
In [49]:
selector.pvalues_
Out[49]:
array([6.69798296e-003, 6.51597103e-161, 2.74981007e-046, 0.00000000e+000,
       0.00000000e+000, 0.00000000e+000, 2.62274871e-016, 2.67143169e-249,
       5.00768922e-001, 2.65703013e-214, 0.00000000e+000, 0.00000000e+000,
       0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 1.17553644e-202,
       4.62662048e-217, 7.43319643e-194, 0.00000000e+000, 9.00744872e-005,
       8.65788663e-006, 2.87418958e-005, 1.36828268e-045, 0.00000000e+000,
       0.00000000e+000, 4.45255856e-231])

Observations : With Chi-Square p-value as the selection criteria, many of the features picked are the different scored aspects of a customer experience, in addition to the reason for travel and the cabin class in which they are travelling.

Back to Top

5. Modelling¶

Modelling library imports¶

In [50]:
import sklearn
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import CategoricalNB
import xgboost
from xgboost import XGBClassifier

Model architecture¶

In [51]:
# Features as selected from feature importance
features = features

# Specifying target variable
target = ['satisfaction']

# Splitting into train and test
X_train = air_train_encoded[features].to_numpy()
X_test = air_test_encoded[features]
y_train = air_train_encoded[target].to_numpy()
y_test = air_test_encoded[target]
In [52]:
X_test.shape
Out[52]:
(23863, 10)

Model Activation Function¶

In [53]:
# Time scores and metrics imports
import time
from resource import getrusage, RUSAGE_SELF
from sklearn.metrics import accuracy_score, roc_auc_score, plot_confusion_matrix, plot_roc_curve, precision_score, recall_score
In [54]:
# Model activation and result plot function
def get_model_metrics(model, X_train, X_test, y_train, y_test):
   
    '''
    Model activation function, takes in model as a parameter and returns metrics as specified.
    
    Inputs: 
        model,  X_train, y_train, X_test, y_test
    
    Output: 
        Model output metrics, confusion matrix, ROC AUC curve
    '''
    
    # Mark of current time when model began running
    t0 = time.time()
    
    # Fit the model on the training data and run predictions on test data
    model.fit(X_train,  y_train)
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:,1]
    # Obtain training accuracy as a comparative metric using Sklearn's metrics package
    train_score = model.score(X_train, y_train)
    # Obtain testing accuracy as a comparative metric using Sklearn's metrics package
    accuracy = accuracy_score(y_test, y_pred)
    # Obtain precision from predictions using Sklearn's metrics package
    precision = precision_score(y_test, y_pred)
    # Obtain recall from predictions using Sklearn's metrics package
    recall = recall_score(y_test, y_pred)
    # Obtain ROC score from predictions using Sklearn's metrics package
    roc = roc_auc_score(y_test, y_pred_proba)
    # Obtain the time taken used to run the model, by subtracting the start time from the current time
    time_taken = time.time() - t0
    # Obtain the resources consumed in running the model
    memory_used = int(getrusage(RUSAGE_SELF).ru_maxrss / 1024)

    # Outputting the metrics of the model performance
    print("Accuracy on Training = {}".format(train_score))
    print("Accuracy on Test = {} • Precision = {}".format(accuracy, precision))
    print("Recall = {} • ROC Area under Curve = {}".format(recall, roc))
    print("Time taken = {} seconds • Memory consumed = {} Bytes".format(time_taken, memory_used))

    # Plotting the confusion matrix of the model's predictive capabilities
    plot_confusion_matrix(model, X_test, y_test, cmap = plt.cm.Blues, normalize = 'all')
    # Plotting the ROC AUC curve of the model 
    plot_roc_curve(model, X_test, y_test)    
    plt.show()
    
    return model, train_score, accuracy, precision, recall, roc, time_taken, memory_used   

5.1 Model Candidate Pipelines¶

Logistic Regression Model¶

Parameters for Logistic Regression Model¶
In [55]:
LogisticRegression().get_params()
Out[55]:
{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}
Performing randomized search to obtain optimal model parameters¶
In [56]:
%%time
clf = LogisticRegression()

params = {'n_jobs': [0, 5, 10, 15, 20]}

rscv = RandomizedSearchCV(estimator = clf,
                         param_distributions = params,
                         scoring = 'f1',
                         n_iter = 10,
                         verbose = 1)
rscv.fit(X_train, y_train)
rscv.predict(X_test)

# Parameter object to be passed through to function activation
params = rscv.best_params_

print("Best parameters:", params)
Fitting 5 folds for each of 5 candidates, totalling 25 fits
Best parameters: {'n_jobs': 5}
CPU times: user 25.5 s, sys: 1.64 s, total: 27.1 s
Wall time: 23.9 s
Running model pipeline and obtaining performance metrics¶
In [57]:
model_lr = LogisticRegression(**params)
model_lr, train_lr, accuracy_lr, precision_lr, recall_lr, roc_lr, tt_lr, mu_lr = get_model_metrics(model_lr,
                                                                                                   X_train,
                                                                                                   X_test,
                                                                                                   y_train,
                                                                                                   y_test)
Accuracy on Training = 0.8776540165510324
Accuracy on Test = 0.873821397142019 • Precision = 0.8552837573385519
Recall = 0.8508712158084298 • ROC Area under Curve = 0.9450786859429265
Time taken = 1.0026297569274902 seconds • Memory consumed = 711520 Bytes

Random Forest Classifier¶

Parameters for Random Forest¶
In [58]:
RandomForestClassifier().get_params()
Out[58]:
{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}
Performing randomized search to obtain optimal model parameters¶
In [59]:
%%time
clf = RandomForestClassifier()

params = { 'max_depth': [5, 10, 15, 20, 25, 30],
           'max_leaf_nodes': [10, 20, 30, 40, 50],
           'min_samples_split': [1, 2, 3, 4, 5]}

rscv = RandomizedSearchCV(estimator = clf,
                         param_distributions = params,
                         scoring = 'f1',
                         n_iter = 10,
                         verbose = 1)
rscv.fit(X_train, y_train)
rscv.predict(X_test)

# Parameter object to be passed through to function activation
params = rscv.best_params_

print("Best parameters:", params)
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best parameters: {'min_samples_split': 5, 'max_leaf_nodes': 40, 'max_depth': 30}
CPU times: user 1min 4s, sys: 1.84 s, total: 1min 6s
Wall time: 1min 6s
Running model pipeline and obtaining performance metrics¶
In [60]:
model_rf = RandomForestClassifier(**params)
model_rf, train_rf, accuracy_rf, precision_rf, recall_rf, roc_rf, tt_rf, mu_rf = get_model_metrics(model_rf,
                                                                                                   X_train,
                                                                                                   X_test,
                                                                                                   y_train,
                                                                                                   y_test)
Accuracy on Training = 0.926418958455237
Accuracy on Test = 0.9277542639232285 • Precision = 0.9168210628961482
Recall = 0.9152146403192836 • ROC Area under Curve = 0.9751138080512387
Time taken = 2.7000980377197266 seconds • Memory consumed = 735700 Bytes

Adaptive Boosting Classifier (AdaBoost)¶

Parameters for AdaBoost Classifier¶
In [61]:
AdaBoostClassifier().get_params()
Out[61]:
{'algorithm': 'SAMME.R',
 'base_estimator': None,
 'learning_rate': 1.0,
 'n_estimators': 50,
 'random_state': None}
Performing randomized search to obtain optimal model parameters¶
In [62]:
%%time
clf = AdaBoostClassifier()

params = { 'n_estimators': [25, 50, 75, 100, 125, 150],
           'learning_rate': [0.2, 0.4, 0.6, 0.8, 1.0]}

rscv = RandomizedSearchCV(estimator = clf,
                         param_distributions = params,
                         scoring = 'f1',
                         n_iter = 10,
                         verbose = 1)
rscv.fit(X_train, y_train)
rscv.predict(X_test)

# Parameter object to be passed through to function activation
params = rscv.best_params_

print("Best parameters:", params)
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best parameters: {'n_estimators': 125, 'learning_rate': 0.6}
CPU times: user 1min 29s, sys: 3.28 s, total: 1min 32s
Wall time: 1min 32s
Running model pipeline and obtaining performance metrics¶
In [63]:
model_ada = AdaBoostClassifier(**params)

# Saving output metrics
model_ada, accuracy_ada, train_ada, precision_ada, recall_ada, roc_ada, tt_ada, mu_ada = get_model_metrics(model_ada,
                                                                                                           X_train,
                                                                                                           X_test,
                                                                                                           y_train,
                                                                                                           y_test)
Accuracy on Training = 0.9074960294240575
Accuracy on Test = 0.9061727360348657 • Precision = 0.9004985044865403
Recall = 0.879197897400954 • ROC Area under Curve = 0.962524637370356
Time taken = 4.641080856323242 seconds • Memory consumed = 739332 Bytes

Categorical Naive Bayes¶

Parameters for Categorical Naive Bayes¶
In [64]:
CategoricalNB().get_params()
Out[64]:
{'alpha': 1.0, 'class_prior': None, 'fit_prior': True, 'min_categories': None}
Performing randomized search to obtain optimal model parameters¶
In [65]:
%%time
clf = CategoricalNB()

params = { 'alpha': [0.0001, 0.001, 0.1, 1, 10, 100, 1000],
           'min_categories': [6, 8, 10]}

rscv = RandomizedSearchCV(estimator = clf,
                         param_distributions = params,
                         scoring = 'f1',
                         n_iter = 10,
                         verbose = 1)
rscv.fit(X_train, y_train)
rscv.predict(X_test)

# Parameter object to be passed through to function activation
params = rscv.best_params_

print("Best parameters:", params)
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best parameters: {'min_categories': 8, 'alpha': 0.0001}
CPU times: user 1.67 s, sys: 96.6 ms, total: 1.77 s
Wall time: 1.77 s
Running model pipeline and obtaining performance metrics¶
In [66]:
model_cnb = CategoricalNB(**params)

# Saving Output Metrics
model_cnb, accuracy_cnb, train_cnb, precision_cnb, recall_cnb, roc_cnb, tt_cnb, mu_cnb = get_model_metrics(model_cnb, 
                                                                                                           X_train, 
                                                                                                           X_test,
                                                                                                           y_train, 
                                                                                                           y_test)
Accuracy on Training = 0.8942572933210733
Accuracy on Test = 0.890122784226627 • Precision = 0.8703649917707426
Recall = 0.8751095103669814 • ROC Area under Curve = 0.9490493379166705
Time taken = 0.09247374534606934 seconds • Memory consumed = 751384 Bytes

Extreme Gradient Boosting Classifier (XGBoost)¶

Parameters for XGBoost Classifier¶
In [67]:
XGBClassifier().get_params()
Out[67]:
{'objective': 'binary:logistic',
 'use_label_encoder': False,
 'base_score': None,
 'booster': None,
 'callbacks': None,
 'colsample_bylevel': None,
 'colsample_bynode': None,
 'colsample_bytree': None,
 'early_stopping_rounds': None,
 'enable_categorical': False,
 'eval_metric': None,
 'gamma': None,
 'gpu_id': None,
 'grow_policy': None,
 'importance_type': None,
 'interaction_constraints': None,
 'learning_rate': None,
 'max_bin': None,
 'max_cat_to_onehot': None,
 'max_delta_step': None,
 'max_depth': None,
 'max_leaves': None,
 'min_child_weight': None,
 'missing': nan,
 'monotone_constraints': None,
 'n_estimators': 100,
 'n_jobs': None,
 'num_parallel_tree': None,
 'predictor': None,
 'random_state': None,
 'reg_alpha': None,
 'reg_lambda': None,
 'sampling_method': None,
 'scale_pos_weight': None,
 'subsample': None,
 'tree_method': None,
 'validate_parameters': None,
 'verbosity': None}
Performing randomized search to obtain optimal model parameters¶
In [68]:
%%time
clf = XGBClassifier()

params = { 'max_depth': [3, 5, 6, 10, 15, 20],
           'learning_rate': [0.01, 0.1, 0.2, 0.3],
           'n_estimators': [100, 500, 1000]}

rscv = RandomizedSearchCV(estimator = clf,
                         param_distributions = params,
                         scoring = 'f1',
                         n_iter = 10,
                         verbose = 1)
rscv.fit(X_train, y_train)
rscv.predict(X_test)

# Parameter object to be passed through to function activation
params = rscv.best_params_

print("Best parameters:", params)
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best parameters: {'n_estimators': 100, 'max_depth': 10, 'learning_rate': 0.1}
CPU times: user 2h 11min 40s, sys: 3min 9s, total: 2h 14min 50s
Wall time: 8min 40s
Running model pipeline and obtaining performance metrics¶
In [69]:
model_xgb = XGBClassifier(**params)

# Saving Output Metrics
model_xgb, accuracy_xgb, train_xgb, precision_xgb, recall_xgb, roc_xgb, tt_xgb, mu_xgb = get_model_metrics(model_xgb,
                                                                                                           X_train,
                                                                                                           X_test,
                                                                                                           y_train,
                                                                                                           y_test)
Accuracy on Training = 0.9503886984870016
Accuracy on Test = 0.9438042157314671 • Precision = 0.9466893378675735
Recall = 0.9213472208702423 • ROC Area under Curve = 0.9876708141468592
Time taken = 5.126989126205444 seconds • Memory consumed = 1375012 Bytes

5.2 Comparing output performance of the model pipelines¶

In [70]:
# Collecting model data
training_scores = [train_lr, train_rf, train_ada, train_cnb, train_xgb]
accuracy = [accuracy_lr, accuracy_rf, accuracy_ada, accuracy_cnb, accuracy_xgb]
roc_scores = [roc_lr, roc_rf, roc_ada, roc_cnb, roc_xgb]
precision = [precision_lr, precision_rf, precision_ada, precision_cnb, precision_xgb]
recall = [recall_lr, recall_rf, recall_ada, recall_cnb, recall_xgb]
time_scores = [tt_lr, tt_rf, tt_ada, tt_cnb, tt_xgb]
memory_scores = [mu_lr, mu_rf, mu_ada, mu_cnb, mu_xgb]

model_data = {'Model': ['Logistic Regression', 'Random Forest', 'Adaptive Boost',
                       'Categorical Bayes', 'Extreme Gradient Boost'],
            'Accuracy on Training' : training_scores,
            'Accuracy on Test' : accuracy,
            'ROC AUC Score' : roc_scores,
            'Precision' : precision,
            'Recall' : recall,
            'Time Elapsed (seconds)' : time_scores,
            'Memory Consumed (bytes)': memory_scores}

model_data = pd.DataFrame(model_data)
model_data
Out[70]:
Model Accuracy on Training Accuracy on Test ROC AUC Score Precision Recall Time Elapsed (seconds) Memory Consumed (bytes)
0 Logistic Regression 0.877654 0.873821 0.945079 0.855284 0.850871 1.002630 711520
1 Random Forest 0.926419 0.927754 0.975114 0.916821 0.915215 2.700098 735700
2 Adaptive Boost 0.906173 0.907496 0.962525 0.900499 0.879198 4.641081 739332
3 Categorical Bayes 0.890123 0.894257 0.949049 0.870365 0.875110 0.092474 751384
4 Extreme Gradient Boost 0.943804 0.950389 0.987671 0.946689 0.921347 5.126989 1375012
In [71]:
# Plotting each model's performance scores vs time elapsed
plt.rcParams["figure.figsize"] = (25,15)

ax1 = model_data.plot.bar(x = 'Model', y = ["Accuracy on Training", "Accuracy on Test", "ROC AUC Score", 
                                            "Precision", "Recall"], 
                          cmap = 'coolwarm')
ax1.legend()

ax1.set_title("Model Comparison", fontsize = 18)
ax1.set_xlabel('Model', fontsize = 14)
ax1.set_ylabel('Result', fontsize = 14, color = 'Black');
In [72]:
# Plotting each model's memory consumption
ax1 = model_data.plot.bar(x = 'Model', y = 'Memory Consumed (bytes)')

ax1.set_title("Resources consumed per model (bytes)", fontsize = 18)
ax2 = model_data['Time Elapsed (seconds)'].plot(secondary_y = True, color = 'Gold', linewidth = 4, marker = 's')
ax1.set_xlabel('Model', fontsize = 14)
ax2.set_ylabel('Time Elapsed (seconds)', fontsize = 14, color = 'Gold', fontweight = 'bold')
ax1.set_ylabel('Memory Consumed (bytes)', fontsize = 14, color = 'Black');

Observations : At this point if I were to be packaging a model pipeline to recommend to a client, I would be promoting the Extreme Gradient Boosting, aka XGBoost algorithm. Although the model had not performed the most efficiently, it did have the most effective performance.

Feature importance of finalist pipeline - Extreme Gradient Boosting Classifier (XGBoost)¶

Metric: Weight

In [73]:
from xgboost import plot_importance

model_xgb.get_booster().feature_names = ['type_of_travel', 'inflight_wifi_service', 'online_boarding',
       'seat_comfort', 'inflight_entertainment', 'on_board_service',
       'leg_room_service', 'cleanliness', 'class_business', 'class_eco']

plot_importance(model_xgb)
plt.show()

Observations : In terms of feature weight (the amount of times a feature was used to split the data) — our model used the seat comfort feature the most to split the data across all possible trees in our model; followed by online boarding, inflight entertainment, onboard service quality, leg room, inflight wi-fi, and cleanliness.

Back to Top

6. Packaging and Explainability¶

Packaging the model¶

Saving the model with pickle¶

In [74]:
import pickle

# Saving test model. 
pickle.dump(model_xgb, open('./Models/model_xgb.pkl', 'wb'))

Loading the model with pickle¶

In [75]:
pickled_model = pickle.load(open('./Models/model_xgb.pkl', 'rb'))
In [76]:
X_test
Out[76]:
type_of_travel inflight_wifi_service online_boarding seat_comfort inflight_entertainment on_board_service leg_room_service cleanliness class_business class_eco
0 1 4.0 3.0 2.0 4.0 4.0 4.0 4.0 0 1
1 1 0.0 3.0 4.0 3.0 3.0 3.0 4.0 1 0
4 1 1.0 0.0 1.0 1.0 1.0 1.0 3.0 0 1
5 1 2.0 4.0 2.0 4.0 3.0 2.0 4.0 0 1
6 1 4.0 4.0 4.0 4.0 4.0 4.0 2.0 1 0
... ... ... ... ... ... ... ... ... ... ...
25971 1 2.0 2.0 3.0 3.0 2.0 1.0 3.0 1 0
25972 1 3.0 3.0 3.0 3.0 3.0 4.0 3.0 1 0
25973 0 1.0 0.0 1.0 1.0 3.0 2.0 1.0 0 1
25974 1 2.0 3.0 3.0 3.0 2.0 1.0 3.0 1 0
25975 0 1.0 1.0 1.0 0.0 0.0 1.0 0.0 0 1

23863 rows × 10 columns

In [77]:
pickled_model.predict(X_test)
Out[77]:
array([1, 1, 0, ..., 0, 1, 0])

Model Explainability using SHAP¶

In [78]:
# !pip install shap
import shap
In [79]:
explainer = shap.Explainer(model_xgb, feature_names = features)
In [80]:
# Explain feature importance in producing model output from base value to actual output
shap_values = explainer(X_train)
In [81]:
shap_values.values[0]
Out[81]:
array([-3.6781335 , -3.247354  , -1.294865  ,  0.16754012, -0.31717607,
        0.04861424, -0.12336768,  0.30478886, -0.51065814,  0.04983278],
      dtype=float32)

Feature Importance¶

In [82]:
shap.plots.bar(shap_values)

Global feature explainability¶

In [83]:
%%time
shap.initjs()
shap.summary_plot(shap_values, X_train, class_names=model_xgb.classes_)
CPU times: user 10.3 s, sys: 154 ms, total: 10.4 s
Wall time: 10.4 s

Observations : Considering the mean SHAP value as our metric for feature importance, we can observe that in-flight wi-fi service is the most impactful feature of our data, followed closely by travel type and online boarding.

Looking at the SHAP summary plot (or beeswarm) which provides a more informative illustration of the feature value of each individual observation in our data and its impact on the model output — we can get a more complete picture of the overall feature impact on our model. The features are also ordered by their impact on prediction, but it also shows us how higher and lower values of each individual observation of a feature will affect the result.

For almost every feature, high feature values have a positive impact on prediction while low feature values have a negatie impact on prediction.

Explaining inflight_wifi_service feature impact¶

In [84]:
shap.plots.scatter(shap_values[:, "inflight_wifi_service"], color=shap_values)

Observations : I wanted to take a closer look into the in-flight wi-fi feature. One, because it is the feature that had the most impact in determining model output; but also particularly because it was a feature that drew my attention during exploratory analysis, seeing as how certain values had high impact driving satisfaction and dissatisfaction among different travel classes and travel types.

To refresh - for travel type I encoded personal travel as 0 and business travel as 1. The manner in which the model utilizes this feature almost directly mirrors the data itself. Where high rankings by leisure travelers has a more informative impact on determining satisfaction and low ratings have more impact on dissatisfaction.

It can also be observed that for business travelers, regardless of how they rated the wi-fi service, there is a portion that are satisfied (positive SHAP value over negatie). However, the group of business travelers who displayed negative SHAP values or were dissatisfied gave low marks to their experience with wi-fi on board.

Explaining online_boarding feature impact¶

In [85]:
shap.plots.scatter(shap_values[:, "online_boarding"], color=shap_values)

Observations : I also wanted to take a quick look at the SHAP impact of online boarding. This is another feature which showed high impact potential, regardless of travel reason or class.

Similarly to what we observed in exploratory analysis - low marks of the boarding process, regardless of travel reason had a negative impact on the model output, pointing towards passenger dissatisfaction. Whereas high remarks were consistent with customers who were more or less satisfied with their overall travel experience.

Back to Top

7. Conclusions¶

At this point I would be able to confidently present to the client that we have data-informed recommendations on where they can improve their customer service, evidenced by an intepretable XGBoost machine learning pipeline that exhibits 95% accuracy and 94.6% precision.

To summarize, it's very crucial to focus on how efficiently passengers get on the plane, and how they're treated while on board.

There is an fundamental difference in satisfaction given how customers react to certain service qualities given their reason for travel. Most important among these service aspects is the quality of in-flight wi-fi service on board.

Regardless of cabin or travel purpose, high marks in wi-fi capabilities is a huge opportunity for increasing customer satisfaction and expanding competitive advantage. Increasing something like availability, making wi-fi complimentary in all cabins would draw in more customers who have readily available video and music streaming. Improving capability is something that would draw in more business travelers knowing that they can increase productivity while in the air.

The efficiency of getting people on board the plane is also something to be improved upon. This isn't specific to one cabin class or purpose of travel - it's universal. Passengers who provided higher marks in boarding the plane were reported to be satisfied with their travel experience, while passengers who gave mediocre or low marks were not.

Secondary to those features, there's also opportunity to grow in terms of the quality of in-flight entertainment, food and beverage, seat comfort, cabin cleanliness, and leg room.

I think it's important that the airline focus on improving these key service aspects to improve customer satisfaction, customer retention, and to establish or expand their competitive advantage against other carriers.

Chris Bacani
Data Scientist Apprentice, IBM Corporation
2022

Back to Top

In [ ]: